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[57] ABSTRACT 

A method is described for compiling a source code 
listing into an object code listing and comprises the 
steps of: extracting a block of source code statements 
from a source code listings; mapping each source code 
statement in the block into a wide intermediate code 
(WIC) statement in object form, a WIC statement defin- 
ing a series of machine actions to perform the func- 
tion^) called for by the source code statement; perform- 
ing an initial approximate simulation of each WIC state- 
ment in a block and deriving performance results from 
the simulation of each WIC statement and the block of 
WIC statements; dependent upon the performance re- 
sults, revising the WIC statements in the block in accor- 
dance with one of a group of code transform algorithms 
and heuristics in an attempt to improve the code's per- 
formance results; and repeating the approximate simula- 
tion to determine if the performance results have been 
improved and, if so, proceeding to another of the algo- 
rithms to enable further revision of the WIC statements, 
until a decision point is reached, and at such time, pro- 
ducing the revised WIC statements in object code form. 

11 Claims, 8 Drawing Sheets 
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OPTIMIZING COMPILER FOR COMPUTERS 

This is a continuation of application Ser. No. 
07/705,331 filed on May 24, 1991 now abandoned. 5 

FIELD OF THE INVENTION 

This invention relates to source code compilers, and, 
more particularly, to a system for transforming source 
code to an intermediate code and, on-the-fly, recon- 10 
figuring the intermediate code to optimize its perfor- 
mance before it is transformed into object code. 

BACKGROUND OF THE INVENTION 

A compiler is a program containing multiple routines 15 
for translating source code into machine (object) code. 
In general, compilers take a high level source language 
(e.g., C, Fortran, etc.) and translate it into a sequential, 
intermediate format code. A dependency analysis is 
performed on the intermediate code statements. That 20 
analysis determines which operands are required to 
produce a given result and allows those operands to be 
available in the correct sequence and at the correct time 
during the processing operation. Subsequently, the 
compiler goes through a general optimizing routine 
which transforms the intermediate code statements into 
a subsidiary intermediate foam characterized by a more 
compact format. For instance, "dead" code is removed, 
common subexpressions are eliminated, and other com- 3Q 
paction techniques are performed. These optimization 
actions are essentially open loop, in that the code is 
subjected to a procedure and is then passed on to a next 
optimization procedure without there being any inter- 
mediate testing to determine the effectiveness of the 35 
optimization. Subsequently, the optimized code state- 
ments are converted into machine language (object 
code). In general, such compiled code is directly run 
and is not subjected to a performance metric to deter- 
mine the efficiency of the resulting object code. ^ 

In summary, compiler optimization procedures are 
basically open loop, in that they select individual state- 
ments in the intermediate code string and pass those 
statements through a list of optimization procedures. 
Once the procedures have been completed, the code is 45 
converted to object code and is not subjected to a fur- 
ther performance measure. 

Recently, with the advent of highly parallel comput- 
ers, compilation tasks have become more complex. To- 
day, the compiler needs to assure both efficient storage 50 
of data in memory and for subsequent availability of 
that data from memory, on a nearly conflict-free basis, 
by the parallel processing hardware. Compilers must 
therefore address the fundamental problem of data- 
structure storage and retrieval from the memory subsys- 55 
terns with the same degree of care associated with iden- 
tification and formation of vector/parallel code con- 
structs. 

Vector processors and systolic arrays are of little use 
if the data becomes enmeshed in traffic jams at both 60 
ends of the units. In order to achieve nearly conflict- 
free access, it is not sufficient to run intermediate code 
through an optimization procedure and "hope" that its 
performance characteristics have been improved. Fur- 
thermore, it is inefficient to fully compile/optimize a 65 
complex source code listing and then be required to 
compare the resulting object code's performance 
against performance metrics, before determining 



:,790 

2 

whether additional code transformations are required to 
achieve a desired performance level. 

The prior art regarding compiler optimization is 
characterized by the following articles which have 
appeared over the years. Schneck et aL in "Fortran to 
Fortran Optimizing Compiler", The Computer Journal, 
Vol. 16, No. 4, pp. 322-330 (1972) describe an early 
optimizer directed at improving program performance 
at the source code level, rather than at the machine 
code level. In 1974, Kuck et al. in "Measurements of 
Parallelism in Ordinary Fortran Programs", Computer 
January 1974, pp. 37-46 describe some early efforts at 
extracting from one program, as many simultaneously 
executable operations as possible. The purpose of that 
action was to improve the performance of a Fortran 
program by enabling certain of its operations to run in 
parallel. 

An optimization procedure for conversion of sequen- 
tial microcode to parallel or horizontal microcode is 
described by Fisher in "Trace Scheduling: A Technique 
for Global Microcode Compaction", IEEE Transac- 
tions on Computers, Vol. C-30, No. 7, July 1981, pp. 
478-W0. 

Heavily parallel multiprocessors and compilation 
techniques therefor are considered by Fisher, in "The 
VLIW Machine: A Multiprocessor for Compiling Sci- 
entific Code" Computer, July 1984, pp. 45-53 and by 
Gupta et al. in "Compilation Techniques for a Recon- 
figurable LIW Architecture", The Journal of Super- 
computing, Vol. 3, pp. 271-304 (1989). Both Fisher and 
Gupta et al. treat the problems of optimization in highly 
parallel architectures wherein very long instruction 
words are employed. Gupta et al. describe compilation 
techniques such as region scheduling, generational code 
for reconfiguration of the system, and memory alloca- 
tion techniques to achieve improved performance. In 
that regard, Lee et al. in "Mapping Nesting Loop Algo- 
rithms into Multidimensional Systolic Arrays", IEEE 
Transactions on Parallel and Distributed Systems, Vol. 
1, No. 1, January 1990, pp. 64-76 describe how, as part 
of a compilation procedure, loop algorithms can be 
mapped onto systolic VLSI arrays. 

The above-cited prior art describes open-loop optimi- 
zation procedures. In specific, once the code is "opti- 
mized", it is converted into object code and then out- 
putted for machine execution. 

Accordingly, it is an object of this invention to pro- 
vide an improved system for compiling source code, 
wherein optimization procedures are employed. 

It is another object of this invention to provide an 
improved compiler wherein code transformed during 
an optimization procedure is immediately tested to de- 
termine if the conversion has improved its performance. 

It is another object of this invention to provide a 
compiler that effectively enables allocation of data 
structures to one or more independent memory spaces 
(domain decomposition) to permit parallel computation 
with minimum subsequent memory conflicts. 

SUMMARY OF THE INVENTION 

A method is described for compiling a source code 
listing into an object code listing and comprises the 
steps of: extracting a block of source code statements 
from a source code listing; mapping each source code 
statement in the block into a wide intermediate code 
(WIC) statement in object form, a WIC statement defin- 
ing a series of machine actions to perform the func- 
tions) called for by the source code statement; perform- 
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rag an initial approximate simulation of each WIC state- the triplet, two each to the register file-floating point 

ment in a block and deriving performance results from unit pairs, (eg., 20, 26 and 22, 28). There is a single 

the simulation of each WIC statement and the block of output 34. Inputs to the triplets pass through switch 11 

WIC statements; dependent upon the performance re- (FIG. 1) but may come from memory or the output of 

suits, revising the WIC statements in the block in accor- 5 another arithmetic/logic structure. Outputs from each 

dance with one of a group of code transform algorithms triplet go to switch 11 from where they may be directed 

and heuristics in an attempt to improve the code's per- to the input of another arithmetic/logic structure or to 

fonnance; and repeating the approximate simulation to memory. 

determine if the performance results have been im- Other processing substructures are shown in FIGS. 3 

proved and, if so, proceeding to another of the algo- 10 and 4, with FIG. 3 illustrating a doublet substructure 36 

rithms to enable further revision of the WIC statements, and FIG. 4 illustrating a singlet substructure 38. It can 

until a decision point is reached, and at such time, pro- be seen that each of the aforesaid substructures is a 

during the revised WIC statements in object code form. subset of the triplet structure shown in FIG. 2 and ac- 

DESCRIPTION OF THE DRAWINGS „ ^Tpu^ ££S A '214 pa.ent operates 

FIG. 1 is a partial block diagram of a prior art arith- with a very-long instruction word (VLIW) having hun- 

metic-logic unit pipeline organization and switching dreds of fields. Portions of the fields of each VLIW 

network which allows for a change in the configuration define the processing structure required to be config- 

of substructures making up the arithmetic-logic unit. ured for each action of the computer. In effect, those 

FIG. 2 is a block diagram of a "triplet" processing 20 portions of the VLIW commands create the reconfigu- 

substructure which is employed in the system of FIG. 1. ration of the system to enable the required processing 

FIG. 3 is a block diagram of a "doublet" processing functions to occur, 

substructure employed in the system of FIG. 1. As above stated, the computer contains a substantial 

FIG. 4 is a block diagram of a "singlet" processing number of independent processors, each of which han- 

substructure employed in the system of FIG. 1. 25 dies either an entire subroutine or a portion of a subrou- 

FIGS. 5-8 illustrate a high level flow diagram of the tine, in parallel with other processors. As a result, it can 

method of the invention. be seen that with the highly reconfigurable nature of 

each computer node in combination with the VLIW 
structure, great flexibility is available in the handling of 

30 processing of complex problems. However, along with 

Prior to describing the details of the method of the this flexibility comes a cost, and that is the difficulty of 

compiler invention disclosed herein, the structure of a assuring that the computer executes its code in the least 

computer particularly adapted to execute the method expended time. The structuring of the system's object 

will be first considered. Details of the to-be-described code is accomplished by a compiler which performs the 

computer are disclosed in U.S. Pat No. 4,811,214 to 35 method broadly shown in the flow diagrams of FIGS. 

Nosenchuck et ah and assigned to the same Assignee as 5-8. 

this application. The disclosure of the '214 patent is Referring now to the flow diagram in FIG. 5, the 

incorporated herein by reference. compiler receives as inputs, a source code listing (box 

In the '214 patent, a highly parallel computer is de- 50) and data defining certain system and operating pa- 
scribed which employs a small number of powerful 40 rameters. As shown in box 52, system architectural 
nodes, operating concurrently. Within any given node, parameters form one input to the compiler and define, 
the computer uses many functional units (e.g., floating for each node, the available system assets and certain 
point arithmetic processors, integer arithmetic/logic specifications with respect thereto. For instance, mem- 
processors, special purpose processors, etc.), organized ory will be defined as to its organization (e.g. number of 
in a synchronous, dynamically-reconfigurable pipeline 45 planes), whether they are physical or virtual, capacity 
such that most, if not all, of the functional units are of the cache, organization of die cache and its operating 
active during each clock cycle of a given node. algorithm, number of reads per clock, writes per clock, 

Each node of the computer includes a reconfigurable and accesses per clock). Further, each processor will 

arithmetic/logic unit (ALU), a multiplane memory and have defined for it the number of available singlets, 

a memory-ALU network switch for routing data be- 50 doublets, and triplets (eg., 4, 8, and 4 respectively), the 

tween memory planes and the reconfigurable ALU. In number of register files and registers in each, the type of 

FIG. 1, a high level block diagram shows a typical access, and whether any special functions are provided 

ALU pipeline switching network along with a plurality for in the processor. Clearly, additional architectural 

of reconfigurable substructures which may be orga- parameters will be provided, however the above pro- 

nized to provide a specifically-called-for pipeline pro- 55 vides one skilled in the art with a ready understanding 

cessing structure. Each reconfigurable pipeline proces- of the type of information that defines system assets and 

sor 10 is formed of various classes of processing ele- their operating characteristics. 

ments (or substructures) and a switching network 11. Optimization parameters are provided as inputs to the 

Three permanently hardwired substructures 12, 14, and compiler (box 54) which, among other specifications, 

16 are each replicated a specific number of times in an 60 indicate the number of times the optimization subrou- 

ALU pipeline processor and are adapted to have their tine should be traversed before a time-out occurs and an 

relative interconnections altered by switching network exit is commanded. 

11. The source code listing is subjected to an initial mem- 
Substructure 12 is illustrated in further detail, at the ory map subroutine (box 56) which is comprised of a 
block diagram level in FIG. 2 and will hereafter be 65 parser and lexical analyzer. These subroutines, along 
called a triplet. Each triplet contains three register files with a pre-optimization symbol-table generator, alio- 
20, 22, and 24; three floating point units 26, 28, and 30; cate memory locations of input memory arrays and 
and one integer logical unit 32. There are four inputs to build a table specifying those locations. More specifi- 
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caily, as source code statements axe received which 

define array sizes, large arrays are allocated (decom- *= consti - (const2*^(0 + const3*c(^)) 
posed) into differing physical locations (domains) in an 

early attempt to avoid subsequent memory reference The WIC statement corresponding to the above For- 

conflicts. 5 tran statement is as follows: 
It will be hereafter assumed that the source code 

listing includes both systolic do-looo statements as well ~ ■ ■ 

as lis* of scalar stat ements. Hereafto, the term "block" URS * S ' #5MA ' +W1UISW.MAI * 

may either refer to a group of scalar statements or to c2*a(i)+c3*c(ij) 

those statements comprising a do-loop (e.g. a systolic 10 ==SB#i2MAi--$7#iiRS*$2#6MAi-pOO % calculate 2 

code block). As is known, each do-loop is a vector ~> % end do 

process which defines an iterative operation to be per- 
formed on a set of operands. The compiler extracts the The dependency analyzer which creates the WIC 
do-loops from the incoming source code input stream above closely follows the format of the source-code, 
(box 58) and converts each source code statement 15 The WIC uses symbol-table mnemonics particular to 
therein to a Wide Intermediate Code (WIC) statement this embodiment of the compiler. The potential for 
(box 60). The wide intermediate code differs from that multiple independent memory planes is reflected in the 
generated by "conventional" vectorizing compilers, in structure of the token. The format of the data-structure 
that the format represents a higher level of specification symbol table tokens is described in Table 1. 
than is typical with sequential code, with immediate 20 Data-structure symbol-table tokens: $ ! xx #mp stor 
local dependencies embodied within the WIC. occ, where: 

Each WIC statement defines a series of machine ac- TART F 1 

tions to perform the function called for by the source — 

COde Statement. Each WIC Statement is in object code _ Format of Data-structure Symbol Tokens 

form and comprises a chain of symbols, tokens, etc. 25 
which substantially define the actions called for by the 
source code statement. It is not executable by the com- 
puter at this stage as it lacks certain linking information. 

In essence, the format of the WIC inherently main- 
tains local parallel and systolic constructs and depen- 30 
dencies found in the original source code. The natural 
relationships between operand fetches, complex inter- 
mediate operations, and result storage are preserved 
within the WIC statements. A single line of WIC code 

often relates directly to corresponding lines in the 35 A . . , t ^ , 

source code. The burden on subsequent analysis to ex- ^ own m the above Exam P le > a WI C is essentially 

tract possible parallel or systolic implementations is ^prised 0 f nested interior dependency nodes (within 

lessened a — » 00 P)- (The loop header code was eliminated for 

WIC may be contrast to ubiquitous sequential inter- simplicity). Here nonterminal internal node pOO indi- 

nal code-formats, typically characterized by simple ™ c * tes a systollc P 1 "™ 6 - ^ P^ase-break is driven by 

load, move, operate, store sequences. This latter format th f parenthetical ordering indicated in the source. The _ 

places an increased burden on the parallel code analyzer token bounded by == is the root of the local depend- 

which must reconstruct many of the "obvious" parallel Mc y ^ Thus » 35 ulu s trat ed by this example, the WIC 

code elements that were explicit in the original source Presents a natural ordering of intermediate and final 

code, 45 results that lend themselves to relatively straight-for- 

The WIC code embodies all of the actions directed ward sub $equent analysis and parallel implementation, 

by the source program. In addition, it maintains symbol- ^ innerited attributes can be parsed much finer 

table attributes and local data dependencies. where, in the limit, conventional sequential intermedi- 

An example of the basic format of a WIC statement is: ate 35 discussed above, would result. However, 

50 this would require increased work from the parallel 

Result = (Oper i0i Oper 2)02 (Oper 303 Oper 4) code analyzer, "and might result in lower parallel perfor- 
mance relative to that expected by WIC analysis. It 

where 0 signifies an arbitrary high-level operation, should be noted that WIC ordering does not signifi- 

such as =, — , X, ~ t and Oper signifies an operand, cantly constrain additional systolic and parallel code 

either from memory, a register, or from the result of a 55 generation, which is performed by the optimizer, 

preceding computation. In this example, the WIC Returning now to FIG. 5, after each do-loop WIC 

shows the local dependencies (based on parenthetical statement is constructed (box 60), internal dependencies 

ordering), where 0 \ and 0 3 may execute in parallel, within the statement are found and recorded. As can be 

with subsequent processing by 02. Systolic execution is seen from boxes 64, 66, and 68, similar acts occur with 

operationally defined by considering data streams that 60 respect to blocks of scalar source code statements. In 

enter an array of processing elements whose outputs are this instance however, the block size is the minimum of 

directly fed into subsequent processor inputs. Data is either the number of lines of code between successive 

thus processed in an assembly line fashion, without the do-loops, or a predefined maximum number of lines of 

need for intermediate storage. code. 

To illustrate an example of the format of the interme- 65 Once both the do-loops and blocks of scalar state- 

diate code, consider a systolic vector operation as ex- ments are converted to WIC statements, those state- 

tracted from a test program and expressed in Fortran as ments are merged (box 70) into a list Then, each WIC 

follows: statement is analyzed to determine which architectural 



Symbol 


Explanation 


S 


data element 
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Scatter/multistore indicator 


XX 
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MA: memory-based arrary 




MS: memory-based scalar 
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occ 
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assets are required to enable it to function. Those assets a following conditional test (for instance, is A greater 

are then allocated (box 72, FIG. 6), and the WIC state- than or less than 1), it will be calculated, 

ment is mapped onto the architectural units to produce As the simulation program runs (see Box 78, FIG. 7), 

the necessary systolic or scalar array that is able to it records the number of memory references; the num- 

process the statement (box 74). 5 ber of primitive arithmetic/logic operations performed; 

At this stage, the compiler has generated a map and the number of memory conflicts which occur. A 

which, in combination with the allocated architectural primitive arithmetic/logic operation is an add, subtract, 

assets, enables configuration of a computational system multiply, or logical compare. More complex operations 

to perform the WIC statement. However, if insufficient are represented by a scaled value of another primitive, 

assets are available to simulate the WIC statement (see 10 p ™ instance, an add operation is equal to 1 whereas a 

decision box 73), the WIC statement must be revised divide operation is equal to 4. 

(box 73) to accomodate the available assets. This may Tne number of memory conflicts are recorded for 

take the form of a statement truncation, a split or some ^ch block of memory so as to enable a subsequent 

other procedure which divides the WIC statement op- reallocation of stored arrays within the memory banks, 

erations into succeeding steps and thus reduces the 15 Other statistics which may be generated by the simula- 

required assets at each step. tor delude cache-misses per plane relative to the num- 

If sufficient assets are found available for the WIC ber ° f . fetch/restore operations per plane, number of 

statement, an array of assets is assembled and the state- conditional pipeline flushes, and number of reconfigura- 

ment is mapped thereon (box 75). tions » a fi™ctwn of conditional statement executions, 

Now, a simulation subroutine (Box 76) is accessed 20 etc ; 

and runs an "approximate" simulation of the assembled £ T^' ?*, snn , ula ? 0 ? 15 Pf^™^ n ° l t0 arnve 

architectural unit in accordance with the mapped WIC * final *™™<f™ Iogicirf results but rather to meter 

statement The goal of the simulation (and of subsequent * e °P e ' atl0n of ^computer and ite aUocated assets in 

optimizations) is to generate object code which wffl run „ fiance of each WIC statement Thus the 

T * c j r *- * » 25 result of the simulation is a set of statistically reliable 

at, or greater than a specified fraction of the computer's ■ ♦ _r v * ■ *■ 

. Jr . . ~T . . , . , , , A . . approximate performance characteristics, 

peak theoretical speed. This is achieved by obtaining an p £ nce ^ ^ simulation of ft WIC statement ^ 

approximate measure of how efficiently each WIC ^ ^ ^ rithm tests ^ 80) whether the block 
statement executes and then modifying the WIC state- or d<yl ^ f m ^ more 
ment m an attempt to improve its performance Jt has 30 statements which have not been simu i at ed?). If state- 
been found unnecessary to fully simulate each WIC ments do remain t0 be simu ] a ted, the program recycles 
statement to obtain this result. In effect, therefore a tQ a^p^ the Nation. If all WIC statements in a 
relatively crude simulation is performed of each WIC block Qr do .j havc been simulated, the program 
statement and such simulation still enables the compiler proce eds and outputs, among other indications, an oper- 
to arrive at a measure of its execution efficiency. 35 ation for ^ block Qr do _ loop; m ope ration count 

In real (non-simulated) operation, each WIC state- for ^ WIC stateme nt in the block or do-loop; and a 

ment and code generated therefrom acts upon large count of memory reference conflicts, including a list of 

arrays of data. The simulation subroutine selects from mem0 ry banks and conflicts for each (box 82). 

the large array, a small subset thereof to act as inputs for ^ snown in decision box 84 in FIG. 8, those outputs 

the subsequent simulation. This prevents the simulator 40 m ^ compar ed to pre-defined operating criteria 

from bogging down as a result of being required to (accessed from the optimization parameters, see box 54 

handle greater amounts of data than needed to derive m FIG< 5^ If it is found ^ Ac do .i 00p or block of 

performance criteria that exhibits statistical validity. SCd ] ar statements executes at an efficiency level greater 

The data array to be used in the simulation is user speci- than the operating criteria, then the WIC statements are 

fled (or in the absence of a specification, a default sub- 45 converted (box 86) to object code. The program exits if 

set )* there are no further WIC statements to be processed 

The approximate simulator "executes" for each WIC (decision box 88), otherwise the method recycles to 

statement, all called-for memory references. Memory handle the next WIC statement On the other hand, if 

references include all references, whether read or write, the outputs indicate a performance efficiency which is 

to each storage array in memory. In addition, computa- 50 less than the called parameters (decision box 84), it is 

tional actions called for by the WIC statement are simu- determined ifthe performance efficiency has improved 

lated, but only in part. For instance, computational over the last "try" (decision box 90). If no performance 

actions which pertain to the computing of memory improvement resulted, the last optimization action is 

references are simulated. The simulation generally only reversed (box 92) and an untried optimization action is 

executes those statements which lead to a subsequent 55 attempted. If there was a performance efficiency im- 

memory reference or any statement which is subjected provement (box 90), the method proceeds to another 

to a following conditional test. untried optimization action (Box 94). 

As an example, consider a reference to an address The compiler proceeds with the optimizer subroutine 
stored in another part of memory which must be calcu- by performing discrete optimization actions on each 
lated, but is dependent upon an indirect address calcula- 60 do-loop or block of scalar WIC code, as the case may 
tion. In this instance, tie address is specified within be. It performs both code transforms and code heuris- 
array A by the value i where i refers to the indirect tics in a serial fashion. For instance, known code trans- 
memory reference. Here, the computation of the value forms are performed, such as the elimination of global 
of i is simulated, but not the value of A. This is because common subexpressions, detection and subsequent sea- 
A is simply a "result" and the simulation is only con- 65 lar processing of loopheader operations, etc. The com- 
cerned with how the computation of A(i) affects the piler also performs code heuristics which may include, 
machine's performance and not the result or answer to but not be limited to, loop fusions to enable redistribu- 
the calculation. If, however, the value of A is subject to tion of unused processing assets, loop interchanging to 
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minimize memory conflicts; dynamic redistribution of 
data in memory to reduce conflicts; loop splits to allow 
independent and parallel sub-loop executions etc. 

After each discrete optimization action is accom- 
plished (box 94), the revised WIC statements in either 5 
the do-loop or block of scalar statements are again simu- 
lated using the approximate simulation subroutine. 
Thus, each do-loop or block of WIC statements has its 
execution simulated, as above described, to obtain a 
new set of outputs for comparison to the predefined \q 
operating criteria. Then it is determined whether the 
number of runs of the optimizer routine equals a limit 
(decision box 96) and, if so, the optimizer routine stops, 
exits, and the WIC statements are converted to object 
code. If the run number has not been equaled, the rou- 15 
tine cycles back and continues. 

As can be seen from the above, the compiler enables 
the individual do-loops and blocks of scalar statements 
to be individually tested using a crude simulation. The 
optimized code statements are subjected to additional 2Q 
optimization subroutines, in an attempt to improve fur- 
ther the code's performance. It will be obvious to one 
skilled in the art that, subsequent to each optimization 
subroutine, the recorded internal dependencies of each 
block and/or do-loop must be reexamined and read- 
justed in accordance with the altered WIC code state- 
ments. This procedure not only optimizes individual 
blocks of code and do-loops, but also may have global 
effects on the entire code structure. For instance, if 
several do-loops are "fused" into a single loop, the de- 
pendencies within and among the do-loops are consid- 30 
ered and altered, as necessary. Also redistribution or 
remapping of data has a similar global effect. Thus, a 
real-time optimization occurs which is tested, at each 
step, to assure that the object code being produced by 
the compiler is as optimum as can be produced, based 35 
on the code transforms and heuristics employed. 

It should be understood that the foregoing descrip- 
tion is only illustrative of the invention. Various alterna- 
tives and modifications can be devised by those skilled 
in the art without departing from the invention. Ac- 40 
cordingly, the present invention is intended to embrace 
all such alternatives, modifications and variances which 
fall within the scope of the appended claims. 

What is claimed is: 

1. A method for compiling a source code listing into 45 
an object code listing, said method performed by a 
computer with compiler software that controls the op- 
eration of said computer, said method comprising, the 
steps of: 

a. extracting a block of source code statements from 50 
said source code listing; 

b. mapping each source code statement in said block 
into a common intermediate code format which 
defines a dependent series of machine actions to 
perform function(s) called for by mapped source 55 
code statements; 

c. using a subset of an input data array required to 
enable full operation of said source code listing, 
executing an approximate simulation of said inter- 
mediate code format into which said block of 60 
source code statements were mapped in step a, and 
deriving performance results from said approxi- 
mate simulation, an approximate simulation gener- 
ally only executing statements which lead to a 
subsequent memory reference and any statement 65 
which is subjected to a following conditional test; 

d. dependent upon a measure of said performance 
results, revising said intermediate code format ap- 
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proximately simulated in step c in an attempt to 
improve said performance results; 
e. repeating steps c and d until a decision point is 
reached, said decision point occurring upon 
achievement of a determined condition. 

2. The method as recited in claim 1 wherein said 
intermediate code format includes wide intermediate 
code (WIC) statements and wherein said revising of said 
intermediate code format in step d is performed on said 
WIC statements in accordance with a routine selected 
from a group of code transforms and heuristics. 

3. The method as recited in claim 2 wherein step b 
comprises the further step of: 

detennining dependencies of procedures defined by 
code within each said WIC statement in said block 
and subsequently causing said initial approximate 
simulation to operate in accordance with said de- 
pendencies. 

4. The method as recited in claim 3 wherein step c 
further comprises: 

cl. providing a list of architectural parameters that 
define available processing elements in said com- 
puter; 

c2. allocating a set of said available processing ele- 
ments to implement a WIC statement; and 

c3. operating an approximate simulation routine to 
process said WIC statement through said allocated 
processing elements. 

5. The method as recited in claim 4 wherein said 
approximate simulation provides a crude simulation of 
each WIC statement, said crude simulation including 
the sub-steps of measuring a number of arithmetic/logic 
primitive operations, per clock cycle, per WIC state- 
ment, a number of memory reference conflicts per WIC 
statement, and accumulating numbers of said operations 
and conflicts for each block. 

6. The method as recited in claim 5 wherein step d 
further comprises: 

dl. comparing said performance results against oper- 
ating criteria for said computer, and performing 
step e if results of said comparing do not indicate 
that said results are at least equal to said operating 
criteria. 

7. The method as recited in claim 6 wherein step d 
further comprises: 

d2. after revising said WIC statements, updating de- 
termined dependencies in accordance with revised 
WIC statements. 

8. The method as recited in claim 7 wherein step d 
further comprises: 

d3. revising, memory address allocations for said 
block of WIC statements in an attempt to reduce 
memory reference conflicts. 

9. The method as recited in claim 8 wherein step d 
further comprises: 

d4. dividing strings of said primitive operations to 
enable parallel execution thereof by available pro- 
cessing elements. ■ 

10. The method as recited in claim 9 wherein step d 
further comprises: 

d5. subsequent to revising said WIC statements, asses- 
sing whether sufficient non-allocated processing 
elements are available to implement tie revised 
said WIC statements and if not, revising said WIC 
statements to utilize available processing elements. 

11. The method as recited in claim 10 wherein said 
decision point in step (e) is a parameter entered by the 
user. 

***** 
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